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Abstract: As the one-chip integration of HW-modules de- 
signed by different companies becomes more and more 
popular reliability of a HW-design and evaluation of 
the tuning behavior during the prototype stage are ab- 
solutely necessary. One way to guarantee reliability is 
the use of robust design styles, e.g., delay -insensitivity. 
For early timing evaluation two aspects must be consid- 
ered: a) The timing needs to be proportional to technol- 
ogy variations and b) the implemented architecture 
should be identical for prototype and target. The first 
can be met also by delay-insensitive implementation. 
The latter one is the key point. A unified architecture is 
needed for prototyping as well as implementation. 
Our new approach to rapid prototyping of signal pro- 
cessing tasks is based on a configurable, delay-insensi- 
tive implemented processor called FLYSIG . In essence, 
the FLYSIG processor can be understood as a complex 
FPGA where the CLBs are substituted by bit-serial op- 
erators. In this paper the general concept is detailed 
and first experimental results are given for demonstra- 
tion of the main advantages: delay-insensitive design 
style, direct correspondence between prototyping and 
target architecture, high performance and reasonable 
shortening of the design cycle. 

1 Introduction 

Rapid prototyping for automatically generated de- 
signs as well as for manually developed designs has 
found a lot of interest during the last years [21]. Most 
approaches map the system's gate-level netlist onto 
field-programmable gate arrays (FPGAs) mainly due 
to the reprogramability of the hardware function, that 
is functionality is easy to change. But in many cases a 
single FPGA's capacity is not sufficient to cover the 
complete synthesized netlist and only by additional 
netlist partitioning an implementation becomes possi- 
ble [28]. Beside the computation overhead for this 
partitioning I/O -restrictions must be met [13, 30]. Par- 
titioning and I/O-routing are both highly dependent 
on the FPGA type, the FPGA interconnections, and 
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the communication protocols. Some providers of mul- 
tiple FPGA boards offer software for netlist partition- 
ing and generation of communication structures [30, 
13]. But these algorithms do not start from an abstract 
gate-level netlist. The netlist must be mapped onto a 
concrete gate-library known to the provider, e.g. the 
LSI10K library [26] and is than automatically re- 
mapped onto the multiple FPGA board (figure 1). 
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Figure 1: Design steps from gate-level netlist to (a) 
single FPGA, (b) multiple FPGA board and (c) FL- 
YSIG processor based implementations. 

In other words, for rapid prototyping the gate-level 
netlist is mapped to a dedicated FPGA architecture. 
Thus elements of the netlist are directly decomposed 
by elements of the FPGA architecture (figure 1 (a)) or 
by elements of a standard gate library and these ele- 
ments are decomposed by elements of the FPGA ar- 
chitecture (figure 1 (b)). This double decomposition is 
the reason for additional costs (number of gate cells, 
interconnection). It points out that the advantages of 
FPGA technology are paid by additional design tasks 
and difficult to meet design restrictions. 

We consider an entirely new approach to rapid pro- 
totyping solving the mentioned above trials. The main 
idea is to derive a prototyping architecture from a do- 
main specific optimized target architecture. This ar- 
chitecture is implemented as configurable processor 
named FLYSlG-prototype processor. The FLYSIG must 
be once provided as chip, i.e. a new prototyping chip, 
figure 2 illustrates the comparison of our approach 
with the described standard approaches. 
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Figure 2: Rapid prototyping approaches: (a) syn- 
chronous, FPGA based and (b) delay-insensitive 
FLYSlG-prototype processor based. 

The target-architecture itself is specialized to the 
application domain of fixed digital signal processing 
algorithms. It is a well known strategy to adapted the 
design methods to a specific application domain. Dif- 
ferent approaches for partitioning and synthesis as 
well as for target architectures have been proposed, 
e.g., for control oriented designs [4, 1], data flow ori- 
ented designs [14, 3], and real time constrained de- 
signs [20, 27]. We applied this principle of 
specialization to prototyping, i.e., the prototyping- ar- 
chitecture is specialized in respect of the target-archi- 
tecture. This idea brought us to the FLYSlG-approach. 
The main advantages are: 

• the elimination of all design tasks related to FP- 
GA-prototyping from the design flow. This short- 
ens the design cycle drastically. The additionally 
introduced design step which derives the FLYSIG- 
target form the FLYSlG-prototype is an easy to au- 
tomate task of much lower complexity. 

• the delay-insensitive design style used for the Fly- 
SlG-processor. The well known gains of delay-in- 
sensitive designs are the elimination of the clock 
signal, power savings, and a very robust modulari- 
zation. Delay-insensitivity is of major importance 
for rapid prototyping because timing analysis on 
the prototype basis within a complex environment 
is essential for reliable system validation and short 
time to market periods. We have examined the 
synthesis of delay-insensitive modules [16] and 
found that the timing behavior of such modules 
can be analyzed in an early design stage, that is the 
technology impact can be approximated quite 
well. 

• the high performance achieved by the FLYSlG-pro- 
cessor, i.e., sampling rates of 50 MHz and more. 



This paper is structured as followed. We present the 
FLYSlG-processor concept in chapter 3 and illustrate 
the benefits by the fifths order elliptic filter example in 
chapter 4. Final conclusions are given in chapter 5. 
First of all some background information is provided 
(chapter 2). 

2 Related work 

Research in synchronous design methods has taken 
place for several decades. A good summary is given in 
[17]. Basic concepts of asynchronous circuit design 
are presented e.g. in [15]. A lot of effort has been in- 
vested in data protocols and data encoding. 

In [12] a true-single -phase protocol has been pre- 
sented. The two-phase protocol is used by [7] and also 
the design of the asynchronous version of the ARM 
processor called AMULET 1 [29] is based on the two- 
phase protocol. In a later version, the AMULET3 pro- 
cessor, the four phase protocol has been used [8] be- 
cause conversion from two-phase protocol to four- 
phase protocol is rather costly. 

Several data encoding styles are known. Dual-rail 
encoding provides two single data lines, one for the 
logic true value and one for the logic false value of a 
one bit data item. This encoding is rather complex but 
there are no problems because of hazards [7]. Recent- 
ly, a combination of single-rail and dual-rail data en- 
coding has been suggested [18]. One approach to 
reduce the number of data lines necessary for dual-rail 
encoding is bundled data encoding [22]. To a set of 
data bits, called bundle, a pair of acknowledge/request 
bits is added for indication of valid data. Thus the 
overhead for the bundle is eliminated but the delay of 
the control lines must be adapted to the delay of the 
data lines [23]. This limited selection of references 
shows that a variety of encoding styles and communi- 
cation protocols have been developed and are used for 
circuits of reasonable complexity. 

Beside data encoding and communication proto- 
cols design methodologies have gained a lot of inter- 
est. An overview is given in [11]. In 1989 Sutherland 
presented the concept of micropipelines [25] which 
has found a lot of interest worldwide. Many investiga- 
tions have been based on this concept [19, 9, 5, 6, 2]. 
The concept of multi-ring structures introduced by 
Staunstrup [24] uses no delay elements but suffer by 
the complexity of the generated circuits. The perfor- 
mance of multi-ring structures is highly influenced by 
the availability of data items and free places ready to 
hand data items on. Free places are commonly called 
bubbles [10]. 

In the presented approach, we use the dual-rail data 
encoding style and the four phase data protocol. We 
adapted the concept of multi-ring structures and 
solved the circuit complexity problem by our own ef- 
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ficiently implemented operator library based on the 
technology described in [16]. Bubbles are integrated 
in a fixed manner into the operators. Additional bub- 
bles are inserted in between the operators during the 
design process. The operators are the essential part of 
the FLYSlG-processor architecture, which is described 
in the next chapter in some detail. 

3 The FLYSlG-processor architecture 

The FLYSlG-processor architecture allows the effi- 
cient implementation of periodic, a priori fixed algo- 
rithms. Such algorithms are common practice in 
digital signal processing, and in real-time controller 
components, e.g., for reactive robotic systems. All al- 
gorithms are also constrained by high sampling rates 
which are determined by rather complex environ- 
ments. In this section the architecture itself and the 
adaptation to prototyping are presented. 

3.1 Overview 

Figure 3 illustrates the dataflow within the FLYSlG- 
processor which is build out of the three depicted 
components. In addition interconnection to the envi- 
ronment is provided. 
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Figure 3: Dataflow within the FLYSlG-processor. 



The FLYSlG-processor is initialized by the configu- 
ration and status-control component. The memory 
and routing component is interconnected with the op- 
eration component to a cyclic structure. Within this 
structure the multi-ring concept is embedded. Fur- 
thermore, this ring- structure is open, that is, addition- 
al FLYSlG-processors can be adapted. Thus processor 
networks can be build easily. 

3.2 Processor concept 

The application which is to be implemented by the 
FLYSlG-processor is specified by its control/data flow 
graph. Each operation is represented by a node in the 
data flow graph. By operation scheduling each data- 
flow node is assigned to an operator of the operation 
component of the FLYSlG-processor. Then the inter- 
connection task, known from synchronous design 
must be performed. For the FLYSlG-prototype version 
this means to initialize the routing component. For the 
target-version the chosen routing configuration is 
hardwired. In figure 4 some details are shown: 



(a) The configuration and status-control compo- 
nent allows the comfortable configuration of the 
FLYSlG-processor. The scheduling information is 
fed into the local memory (registers) and into the 
routing component for initial operand and result 
forwarding. In addition the initial operands are 
stored into the memory. All information can be 
provided by a configuration host before execution. 

(b) The memory and routing component handles 
the operands and the computation results. A data 
item in the memory is referred to as token, thus it 
is not distinguished if it is an operand or a result. 
The tokens flow from the memory into the opera- 
tion component via the token evaluation. This 
block determines if a memory cell contains a valid 
token. 
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Figure 4: Concept of the FLYSlG-processor with (a) 
the configuration and status component, (b) the 
memory and routing component and (c) the opera- 
tion component. 

In this case the routing block directs the token to 
the corresponding operator. Each token consists of 
the operation id (identifying the operation), a val- 
id-flag segment (indication the availability of op- 
erands) and a guard- flag segment (determining 
where the result values are needed). 

(c) The operation component implements the opera- 
tors for all possible computations. In the proto- 
type-version a large set of operators is provided in 
order to support as many different algorithms as 
good as possible. Operators read the tokens from 
memory or from previous involved operators. 
Thus, operator pipelining is possible. Because of 
the bit-serial implementation style this leads to 
deep pipelines with very few hazards and thus to 
high throughput rates. The computation results are 
written into the local registers or directly to the 
registers of a further FLYSlG-processor. This is an 
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important feature which allows to distribute the 
implementation of a single algorithm over several 
FLYSlG-processors. Also prototype-versions may 
be connected with target-versions which are al- 
ready available. This is of high practical impor- 
tance because a stepwise migration from the 
prototype environment to the target implementa- 
tion becomes feasible. 

3.3 Prototyping 

The prototype version of the FLYSlG-processor dif- 
fers from the target version only by the implementa- 
tion of the routing component and the complexity of 
the operation component. Because the concept of the 
FLYSlG-prototype version has been derived form the 
target version the operator concept and the dataflow 
remains unchanged. Just the hardwired scheduling 
implementation is exchanged by a configurable one. 
This allows the mapping of several different algo- 
rithms onto the same FLYSlG-prototype processor. In 
this context a configurable scheduling can be imple- 
mented by simple associatively controlled crossbar 
switches. Furthermore, in the FLYSlG-prototype pro- 
cessor a set of operators is provided with most com- 
mon operators. This operator set is only restricted by 
the design size. Once an algorithm has been mapped 
onto the prototype version and hardware-in-the-loop 
simulation has been successful the FLYSlG-target pro- 
cessor can be derived from the prototype version eas- 
ily by eliminating all unused operators and replacing 
the configurable scheduling by a hardwired imple- 
mentation. This eliminations and replacements can be 
performed automatically. 

3.4 Operators 

The Flysig's operation component provides opera- 
tors with control, storage elements and arithmetic 
functionality. 

3.4.1 Control operators 

For control of delay-insensitive multi-ring based 
architectures several operators have been presented 
by Staunstrup [24] including asymmetric switches, 
join, and fork operators. We extended this set of con- 
trol operators by select operators which allow the 
communication between different rings. The block 
symbol and the register-transfer level netlist of the 
read-select operator are presented in figure 5. 

The RSELECT operator is controlled by the dia- 
mond input which determines from which input the 
data is read. The opposite behavior is realized by the 
WSELET operator reading form the only input and 
writing to the indicated output. Both operators are es- 
sential to implement control flow. 
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Figure 5: a) Block symbol and (b) RT-netlist of read- 
select operator. 

3.4.2 Storage elements 

Three basic register types are needed. All are de- 
rived from an uninitialized minimal register. In addi- 
tion a O-initialized register and a 1 -initialized register 
is needed.These basic register elements can be queued 
to shift registers. Is is important that for each data bit 
within the shift register an extra empty basic register 
element should be provided thus optimal throughput 
can be reached. 

3.4.3 Fixpoint arithmetic operators 

Operators for fixpoint arithmetic can be construct- 
ed out of c-gates and synchronous OR gates. Such cir- 
cuits can be generated by standard two level synthesis 
technics whereby the AA^D-plane is substituted by a c- 
gate plane [24]. This design style is called DIMS 1 and 
leads to a mixture of synchronous and asynchronous 
gate-level components and employs large numbers of 
c-gates. 
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Figure 6: Add-operator: (a) DIMS and (b) complete 
dual-rail implementation. 

We build operators based on dual-rail compatible 
implementation of logic gates. This eliminates the 
large number of c-gates and ensures a completely 
time invariant design. In figure 6 both implementa- 
tions are compared. The complexity of a c-gate (cir- 
cle) and a dual-rail AND-gate is comparable. 

By this bit-serial add-operator and several basic 
register elements a complete bit-serial full-adder can 
be constructed. The RT-level netlist is given in 
figure 7. 
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Figure 7: Complete bit-serial full-adder netlist. 

The modularity is quite obvious and because of the 
dual-rail data encoding delay-insensitivity is main- 
tained on each hierarchy level. By this operator imple- 
mentation style further operators are implemented 
and simulated on RT-level. Simulation is based on a 
VHDL behavioral description of each basic entity. 
Detailed timing data from transistor-netlist simulation 
is used within the RT-level VHDL descriptions which 
allows very fast realistic evaluation of timing. 

3.4.4 Optimization 

The implementation style for FLYSlG-operators al- 
lows high optimizations for the implementation of op- 
eration queues. For illustration, we consider the 
computation for the term x'= a+x + x + x. A straight 
forward implementation is shown in figure 8 (a). 
Three full-adders and three basic registers are allocat- 
ed. 
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Figure 8: (a) Straight forward and (b) optimized im- 
plementation of the term x'=a+x+x+x 



A much cheaper solution with exactly the same 
functionality is given in figure 8 (b). Only two full- 
adders and two basic register elements are needed, 
whereby one register element has been initialized. 
This inserted data item implements a shift operator 
with very low costs. 

4 Example 

For demonstration of the FLYSlG-processor's con- 
cept and benefits we present an example. It is taken 
from the well known high level synthesis benchmark 
suit. The fifths order elliptic filter requires reasonable 
computation performance and is simple enough for 



demonstration. From this filter smaller subcompo- 
nents have been derived for detailed case studies. 
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Table 1 . Characteristics of filter benchmarks 



All filter benchmarks have been mapped onto the 
FLYSlG-prototype processor. Based on this prototype 
timing evaluation has been performed. As a first step 
we have implemented a VHDL environment for sim- 
ulation of the FLYSlG-prototype processor. This in- 
cludes VHDL descriptions of all gate-level cells, 
operational units and complete operators. The timing 
characteristics of a the used gate level cells have been 
obtained from analog simulation of the transistor 
netlists and where imported in the VHDL implemen- 
tations. Based on this two level simulation realistic 
timing and functional evaluation can be performed 
very quickly. All simulations up to several thousand 
ns execution time of the FLYSlG-processor could be 
performed within some cpu seconds which is negligi- 
ble few compared to other approaches e.g., based on 
petri-net simulation or single transition graph simula- 
tion. 

Considering the timing characteristics, the circuit's 
latency and throughput are important. We have both 
examined for the FLYSIG- implementations of the filter 
applications of table 1. Latency is determined by the 
longest operational path within the circuit. In addition 
the number of registers is important because registers 
are used to implement bubbles. Thus higher latency 
values are found for the same filter-functionality im- 
plemented with fewer registers. Of course, this is a 
performance/size trade-off. The determined latency 
values are depicted in figure 9. But throughput is of 
much higher importance because latency can be re- 
garded as system setup time. In figure 10 the best 
throughputs for all examined filter applications are 
shown which are reached by an optimal number of 
bubble registers. 
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Figure 9: Latency of filter applications implemented 
by FLYSlG-operators. 

All filters show the same throughput rate although 
the number of operations differ. This is due to the 
deeply pipelined operators and the delay-insensitive 
design style. The throughput is only restricted by the 
operator's throughput which is rather high because of 
the efficient implementation of the dual-rail gates. 

1 




Figure 10: Throughput of filter applications imple- 
mented by FLYSlG-operators. 

5 Conclusion 

In this paper we presented a new methodology for 
rapid prototyping of cyclic signal processing applica- 
tions. The FLYSIG processor was developed for proto- 
typing. From the FLYSlG-prototype implementation 
the Flysig-target can be derived easily. It has been 
shown on simulation base that this prototyping meth- 
odology provides very fast prototype environments 
wherein hardware-in-the-loop simulation is possible. 

Further investigations will include the extension of 
the operator library by floating-point operators as well 
as by trigonomic operators. The automation of all de- 
sign tasks specialized to the FLYSIG- processor is also 
under development. 
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